Temporal Modeling Approaches for Large-scale Youtube-8M Video Understanding

نویسندگان

  • Fu Li
  • Chuang Gan
  • Xiao Liu
  • Yunlong Bian
  • Xiang Long
  • Yandong Li
  • Zhichao Li
  • Jie Zhou
  • Shilei Wen
چکیده

This paper describes our solution for the video recognition task of the Google Cloud & YouTube-8M Video Understanding Challenge that ranked the 3rd place. Because the challenge provides pre-extracted visual and audio features instead of the raw videos, we mainly investigate various temporal modeling approaches to aggregate the frame-level features for multi-label video recognition. Our system contains three major components: two-stream sequence model, fast-forward sequence model and temporal residual neural networks. Experiment results on the challenging Youtube8M dataset demonstrate that our proposed temporal modeling approaches can significantly improve existing temporal modeling approaches in the large-scale video recognition tasks. To be noted, our fast-forward LSTM with a depth of 7 layers achieves 82.75% in term of GAP@20 on the Kaggle Public test set.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Large-Scale YouTube-8M Video Understanding with Deep Neural Networks

Video classification problem has been studied many years. The success of Convolutional Neural Networks (CNN) in image recognition tasks gives a powerful incentive for researchers to create more advanced video classification approaches. As video has a temporal content Long Short Term Memory (LSTM) networks become handy tool allowing to model long-term temporal clues. Both approaches need a large...

متن کامل

An Effective Way to Improve YouTube-8M Classification Accuracy in Google Cloud Platform

Large-scale datasets have played a significant role in progress of neural network and deep learning areas. YouTube-8M is such a benchmark dataset for general multilabel video classification. It was created from over 7 million YouTube videos (450,000 hours of video) and includes video labels from a vocabulary of 4716 classes (3.4 labels/video on average). It also comes with pre-extracted audio &...

متن کامل

The Monkeytyping Solution to the YouTube-8M Video Understanding Challenge

This article describes the final solution 1 of team monkeytyping, who finished in second place in the YouTube-8M video understanding challenge. The dataset used in this challenge is a large-scale benchmark for multi-label video classification. We extend the work in [1] and propose several improvements for frame sequence modeling. We propose a network structure called Chaining that can better ca...

متن کامل

YouTube-8M: A Large-Scale Video Classification Benchmark

Many recent advancements in Computer Vision are attributed to large datasets. Open-source software packages for Machine Learning and inexpensive commodity hardware have reduced the barrier of entry for exploring novel approaches at scale. It is possible to train models over millions of examples within a few days. Although large-scale datasets exist for image understanding, such as ImageNet, the...

متن کامل

Learnable pooling with Context Gating for video classification

Common video representations often deploy an average or maximum pooling of pre-extracted frame features over time. Such an approach provides a simple means to encode feature distributions, but is likely to be suboptimal. As an alternative, we here explore combinations of learnable pooling techniques such as Soft Bag-of-words, Fisher Vectors, NetVLAD, GRU and LSTM to aggregate video features ove...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1707.04555  شماره 

صفحات  -

تاریخ انتشار 2017